Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All sector_classification datasets have no duplicate values of code column #327

Merged
merged 19 commits into from
Feb 12, 2024

Conversation

jdhoffa
Copy link
Member

@jdhoffa jdhoffa commented Feb 5, 2024

Closes #229

@jdhoffa
Copy link
Member Author

jdhoffa commented Feb 5, 2024

Looking into this now, and there are some issues:
For the cnb_classification file, there is a code that is mapped to a whole bunch of different sectors. My understanding is that this should not be allowed in the Banks code, as it is an indication that the "code level" is not granular enough to identify the correct sector.

Screenshot 2024-02-05 at 15 36 52

@jdhoffa
Copy link
Member Author

jdhoffa commented Feb 5, 2024

A similar issue presents itself in the gics_classification file:
Screenshot 2024-02-05 at 15 43 57

@jdhoffa
Copy link
Member Author

jdhoffa commented Feb 5, 2024

Similarly for isic_classification, there are many sectors that are erroneously tied to two different sectors. We need to define which sectors are correct here:
Screenshot 2024-02-05 at 15 46 48

@jdhoffa
Copy link
Member Author

jdhoffa commented Feb 5, 2024

nace_classification is the most complicated here.

What happens is that original_code gets mapped to code, which removes the decimal mark. This is a bad idea as it effectively converts different codes (e.g. 6.2 and 62) to the same "formatted" code 620.

This is almost certainly a bug, and should be reverted, however, we do risk affecting the downstream user as it will change the expected sector classification code. I leave it to @jacobvjk and George to determine how risky this change will end up being:

Screenshot 2024-02-05 at 15 48 00

@jdhoffa
Copy link
Member Author

jdhoffa commented Feb 6, 2024

@jdhoffa jdhoffa marked this pull request as ready for review February 7, 2024 14:42
@jdhoffa
Copy link
Member Author

jdhoffa commented Feb 7, 2024

This is ready for review.

Progress on the deprecation of cnb_classification and isic_classification will be tracked in #329

@jdhoffa jdhoffa requested a review from jacobvjk February 7, 2024 14:43
@jdhoffa jdhoffa self-assigned this Feb 7, 2024
@jdhoffa jdhoffa added the bug an unexpected problem or unintended behavior label Feb 7, 2024
Copy link
Member

@jacobvjk jacobvjk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be some stray code that I could not see used anywhere else in the repo. See comment. Maybe remove?

@@ -2,15 +2,51 @@ library(usethis)

source(file.path("data-raw", "utils.R"))

letter_to_number <- Vectorize(letter_to_number)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this does not seem to be used anywhere

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch!

Copy link
Member

@jacobvjk jacobvjk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking change ☠️ API change likely to affect existing code bug an unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging this pull request may close these issues.

nace_classification has duplicate values
2 participants